SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.
Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.
Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.
Running LTX-2 and Wan 2.2 on an M1 Max 64GB. FP8 doesn't work on Metal, bypassed with GGUF. Wan 2.2 takes 82 minutes for a 2-second video. LTX-2's official pipeline produces NaN on MPS, and the KSampler fallback doesn't reach usable quality.
Upscaling images loaded via the Load Image node was producing garbled output. Fixed it by addressing the non-contiguous tensor issue — a one-line patch to comfy/utils.py.
Overview of Black Forest Labs' FLUX.2 Klein 9B model and how it performs on M1/M2/M3/M4 Macs. Covers the key factors behind the CUDA vs MPS performance gap, including memory bandwidth and FP8 quantization.
A plan to build an internal help desk RAG system using a Mac mini M4 Pro and Dify. Highlights what's new in Dify circa 2025 and tips for running local LLMs.
Introducing a Mac mini M4 Pro to build an in-house RAG system. A plan for setting up a LoRA training environment during downtime while waiting for specs to be finalized.